{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --user graphviz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 12 - Decision Trees\n",
    "\n",
    "For this lab, we will use survey data collected by the city of [Somerville, MA](https://en.wikipedia.org/wiki/Somerville,_Massachusetts) asking residents about their happiness, as well as ratings of city services. \n",
    "\n",
    "The data is available from the UC Irvine Machine Learning Repository: [https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey](https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey)\n",
    "\n",
    "The link to download the data is [https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv)\n",
    "\n",
    "The data columns are:\n",
    "\n",
    "- D = decision attribute (D) with values 0 (unhappy) and 1 (happy) \n",
    "- X1 = the availability of information about the city services \n",
    "- X2 = the cost of housing \n",
    "- X3 = the overall quality of public schools \n",
    "- X4 = your trust in the local police \n",
    "- X5 = the maintenance of streets and sidewalks \n",
    "- X6 = the availability of social community events \n",
    "\n",
    "Attributes X1 to X6 have values 1 to 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn import tree\n",
    "import graphviz\n",
    "from graphviz import Source\n",
    " \n",
    "from sklearn.tree import export_graphviz\n",
    "import sklearn.metrics as met\n",
    "\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the data into a dataframe.  We have given the columns more descriptive names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_column_names = [\"happy\",\"city_info\",\"housing_cost\", \"school_quality\", \\\n",
    "                    \"trust_police\", \"streets_sidewalks\", \"community_events\"]\n",
    "city = pd.read_csv(\"SomervilleHappinessSurvey2015.csv\", \\\n",
    "                    encoding = \"utf-16le\",names = new_column_names, \\\n",
    "                    header = 0)\n",
    "city.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classwork\n",
    "\n",
    "The code belows allows you to make your own decision tree.  What three conditions should you use to get the highest accuracy?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# top level of decision tree\n",
    "filter_level_1 = city[\"school_quality\"] < 4\n",
    "level_2_left = city[filter_level_1]\n",
    "level_2_right = city[~filter_level_1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# second level of decision tree on left\n",
    "filter_level_2_left = level_2_left[\"housing_cost\"] < 4\n",
    "level_3_left_left = level_2_left[filter_level_2_left]\n",
    "level_3_left_right = level_2_left[~filter_level_2_left]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# second level of decision tree on right\n",
    "filter_level_2_right = level_2_right[\"community_events\"] < 4\n",
    "level_3_right_left = level_2_right[filter_level_2_right]\n",
    "level_3_right_right = level_2_right[~filter_level_2_right]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make predictions\n",
    "\n",
    "proportion_1 = level_3_left_left[\"happy\"].sum()/level_3_left_left.shape[0]\n",
    "if (proportion_1 >= 0.5):\n",
    "    confusion_matrix_left_left = confusion_matrix(level_3_left_left[\"happy\"],np.ones(level_3_left_left.shape[0]))\n",
    "else:\n",
    "    confusion_matrix_left_left = confusion_matrix(level_3_left_left[\"happy\"],np.zeros(level_3_left_left.shape[0]))\n",
    "\n",
    "proportion_1 = level_3_left_right[\"happy\"].sum()/level_3_left_right.shape[0]\n",
    "if (proportion_1 >= 0.5):\n",
    "    confusion_matrix_left_right = confusion_matrix(level_3_left_right[\"happy\"],np.ones(level_3_left_right.shape[0]))\n",
    "else:\n",
    "    confusion_matrix_left_right = confusion_matrix(level_3_left_right[\"happy\"],np.zeros(level_3_left_right.shape[0]))\n",
    "\n",
    "proportion_1 = level_3_right_left[\"happy\"].sum()/level_3_right_left.shape[0]\n",
    "if (proportion_1 >= 0.5):\n",
    "    confusion_matrix_right_left = confusion_matrix(level_3_right_left[\"happy\"],np.ones(level_3_right_left.shape[0]))\n",
    "else:\n",
    "    confusion_matrix_right_left = confusion_matrix(level_3_right_left[\"happy\"],np.zeros(level_3_right_left.shape[0]))\n",
    "\n",
    "\n",
    "proportion_1 = level_3_right_right[\"happy\"].sum()/level_3_right_right.shape[0]\n",
    "if (proportion_1 >= 0.5):\n",
    "    confusion_matrix_right_right = confusion_matrix(level_3_right_right[\"happy\"],np.ones(level_3_right_right.shape[0]))\n",
    "else:\n",
    "    confusion_matrix_right_right = confusion_matrix(level_3_right_right[\"happy\"],np.zeros(level_3_right_right.shape[0]))\n",
    "\n",
    "cm = confusion_matrix_left_left + confusion_matrix_left_right + confusion_matrix_right_left + \\\n",
    "                            confusion_matrix_right_right\n",
    "\n",
    "tn, fp, fn, tp = cm.ravel()\n",
    "\n",
    "sensitivity = tp/(tp + fn)\n",
    "specificity = tn/(tn + fp)\n",
    "precision = tp/(tp + fp)\n",
    "accuracy = (tp + tn)/(tp + tn + fp + fn)\n",
    "\n",
    "print(\"Sensitivity:\",sensitivity)\n",
    "print(\"Specificity:\",specificity)\n",
    "print(\"Precision:\", precision)\n",
    "print(\"Accuracy:\",accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting a decision tree with sci-kit learn\n",
    "\n",
    "We can get just the independent variables (x's) using the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city.iloc[:,1:7]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we created the decision tree classifier variable (object) and then fit it to our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = tree.DecisionTreeClassifier(max_depth = 2)\n",
    "clf = clf.fit(city.iloc[:,1:7], city[\"happy\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are running Jupyter Hub on your own computer, you may be able to display the decision tree by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree.plot_tree(clf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are using the Jupyter Hub server, run the following code (which will give an error):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dot_data = tree.export_graphviz(clf, out_file=None) \n",
    "graph = graphviz.Source(dot_data) \n",
    "graph.render(\"happiness.dot\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, despite the error, there should now be a file called happiness.dot in your directory.  To view the fitted decision tree, open the happiness.dot file in Jupyter and copy the text.  Paste this text into the text box at [http://www.webgraphviz.com](http://www.webgraphviz.com) and click the \"Generate graph!\" button at the bottom.\n",
    "\n",
    "The column names have been replaced by `X[0], X[1], ..., X[5]`.  Run the following code to change `X[0], X[1], ..., X[5]` to the column names in happiness.dot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open (\"happiness.dot\", \"r\") as fin:\n",
    "    with open(\"happiness_fixed.dot\",\"w\") as fout:\n",
    "        for line in fin.readlines():\n",
    "            line = line.replace(\"X[0]\",\"city_info\")\n",
    "            line = line.replace(\"X[1]\",\"housing_cost\")\n",
    "            line = line.replace(\"X[2]\",\"school_quality\")\n",
    "            line = line.replace(\"X[3]\",\"trust_police\")\n",
    "            line = line.replace(\"X[4]\",\"streets_sidewalks\")\n",
    "            line = line.replace(\"X[5]\",\"community_events\")\n",
    "            fout.write(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy the contents of happiness.dot into the textbox in [http://www.webgraphviz.com](http://www.webgraphviz.com) to display the decision tree with the column names.  How does it compare the the decision tree you made?\n",
    "\n",
    "To make predictions, we can use the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = clf.predict(city.iloc[:,1:7])\n",
    "predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compute the confusion matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "met.confusion_matrix(city[\"happy\"], predictions)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get the true negatives, false positives, false negatives, and true positives:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tn, fp, fn, tp = met.confusion_matrix(city[\"happy\"], predictions).ravel()\n",
    "\n",
    "sensitivity = tp/(tp + fn)\n",
    "specificity = tn/(tn + fp)\n",
    "precision = tp/(tp + fp)\n",
    "accuracy = (tp + tn)/(tp + tn + fp + fn)\n",
    "\n",
    "print(\"Sensitivity:\",sensitivity)\n",
    "print(\"Specificity:\",specificity)\n",
    "print(\"Precision:\", precision)\n",
    "print(\"Accuracy:\",accuracy)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}